Exploration of Red Wine Quality by Kyungwon Chun

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

This tidy dataset contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating from 0 (very bad) to 10 (very excellent).

Univariate Plots Section

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Most of the wines have fixed acidity between 7.10 and 9.20.

The valatile acidity shows a bimodal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Most of the wines have volatile acidity between 0.39 and 0.64.

The residual sugar shows left-biased and long-tailed distribution.

The chlorides show left-biased and long-tailed distribution.

The total sulfur dioxide has some outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Most of the wines have a density between 0.9956 and 0.9978.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Most of the wines have pH between 3.210 and 3.400.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Most of the wines have 5 or 6 in quality.

Univariate Analysis

What is the structure of your dataset?

There are 1,5999 red wines in the dataset with 13 features (X, fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, alcohol, quality). X identifies the wines, and quality represents that how good the wine. The X and quality are unordered and ordered factor variables, but I treated them as numerical variables for convenience. All other features represent chemical properties of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Other observations: * Wines with quality 5 or 6 are most common. * The median wine quality is 6. * Most wines have a quality of 5 or better. * About 75% of wines have a quality of 6 or worse. * The worst and best quality in the data set is 3 and 8, respectively.

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is quality. I’d like to determine which features are best for predicting the wine quality. I suspect quality and some combination of the other variables can be used to build a predictive model for wine quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The primary wine characteristics are sweetness, acidity, tannin, alcohol, and body. Residual sugar, fixed and volatile acidity, alcohol, and density determine those characteristics. I guess that these variables are mainly related to the wine quality.

Did you create any new variables from existing variables in the dataset?

I created a variable for the total acidity using the volatile and the fixed acids.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Volatile acidity shows a bimodal distribution.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.26        0.67
## volatile.acidity             -0.26             1.00       -0.55
## citric.acid                   0.67            -0.55        1.00
## residual.sugar                0.11             0.00        0.14
## chlorides                     0.09             0.06        0.20
## free.sulfur.dioxide          -0.15            -0.01       -0.06
## total.sulfur.dioxide         -0.11             0.08        0.04
## density                       0.67             0.02        0.36
## pH                           -0.68             0.23       -0.54
## sulphates                     0.18            -0.26        0.31
## alcohol                      -0.06            -0.20        0.11
## quality                       0.12            -0.39        0.23
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
## quality                        0.01     -0.13               -0.05
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00
## quality                             -0.19   -0.17 -0.06      0.25    0.48
##                      quality
## fixed.acidity           0.12
## volatile.acidity       -0.39
## citric.acid             0.23
## residual.sugar          0.01
## chlorides              -0.13
## free.sulfur.dioxide    -0.05
## total.sulfur.dioxide   -0.19
## density                -0.17
## pH                     -0.06
## sulphates               0.25
## alcohol                 0.48
## quality                 1.00

The fixed acidity and volatile acidity has strong positive and negative correlations with citric acid.

The pH has a strong negative correlation with fixed acidity, citric acid, but does not with volatile acidity.

The fixed acidity and alcohol have significant positive and negative correlations with density, respectively.

Most of the variables do not seem to have strong correlations with quality, but alcohol and volatile acidity have considerable positive and negative correlation with quality, respectively.

ggplot(wqr, aes(x=fixed.acidity, y=pH)) +
  geom_point(alpha = 0.3, size = 2) +
  stat_smooth(method='lm')

The strongest correlation in this data set appears between fixed acidity and pH. High acidity means low pH, and the graph coincides with this fact.

ggplot(wqr, aes(x=fixed.acidity, y=citric.acid)) +
  geom_point(alpha = 0.3, size = 2) +
  stat_smooth(method='lm')

ggplot(wqr, aes(x=fixed.acidity, y=density)) +
  geom_point(alpha = 0.3, size = 2) +
  stat_smooth(method='lm')

The fixed acidity has strong positive correlations with citric acid and density, too.

ggplot(wqr, aes(x=citric.acid, y=volatile.acidity)) +
  geom_point(alpha = 0.3, size = 2) +
  stat_smooth(method='lm')

ggplot(wqr, aes(x=citric.acid, y=pH)) +
  geom_point(alpha = 0.3, size = 2) +
  stat_smooth(method='lm')

The citric acid has considerable negative correlations with volatile acidity and pH.

ggplot(wqr, aes(x=alcohol, y=density)) +
  geom_point(alpha = 0.3, size = 2) +
  stat_smooth(method='lm')

The alcohol and density also shows considerable negative correlation.

ggplot(wqr, aes(x=quality, y=volatile.acidity)) +
  geom_jitter(alpha = 0.3, size = 2) +
  stat_smooth(method='lm')

ggplot(wqr, aes(x=quality, y=alcohol)) +
  geom_jitter(alpha = 0.3, size = 2) +
  stat_smooth(method='lm')

Two variables, volatile acidity and alcohol have considerable correlation with quality.

ggplot(wqr, aes(x=factor(quality), y=alcohol)) + 
  geom_boxplot(notch=FALSE)

It seems that medium and high quality wines have positive relations to alcohol.

ggplot(wqr, aes(x=factor(quality), y=volatile.acidity)) + 
  geom_boxplot(notch=FALSE)

The trend between volatile acidity and quality is clear. The better quality wine has the less volatile acidity.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The quality correlates with alcohol and volatile acidity.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Citric acid is one of the main component of fixed acidity. As a result, they have strong positive correlation.

Low pH causes more fixed acidiity. Therfore, fixed acidity and citric acid negatively correlates to the pH.

A wine with more volatle acidity tends to have less citric acid.

A wine with more fixed acidity tends to more dense. By the way, A wine with more alcohol tends to less dense.

What was the strongest relationship you found?

The fixed acidity is positively and strongly corrrelated with citric acid and density. The citric acid may substitute for fixed acidity and density with even better estimation of wine quality.

Multivariate Plots Section

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

ggplot(wqr, aes(x=alcohol, y=volatile.acidity)) +
  geom_point(alpha = 0.5, size = 2, position = 'jitter', aes(color=quality)) + 
  scale_color_gradient2(midpoint=mean(wqr$quality), low="blue", mid="white", 
                        high="red", space ="Lab" )

ggplot(wqr, aes(x=alcohol, y=1/volatile.acidity)) +
  geom_point(alpha = 0.5, size = 2, position = 'jitter', aes(color=quality)) + 
  scale_color_gradient2(midpoint=mean(wqr$quality), low="blue", mid="white", 
                        high="red", space ="Lab" )

ggplot(wqr, aes(x=alcohol, y=1/volatile.acidity)) +
  geom_point(alpha = 0.5, size = 2, position = 'jitter', aes(color=quality)) + 
  scale_color_gradient2(midpoint=mean(wqr$quality), low="blue", mid="white", 
                        high="red", space ="Lab" )

cor.test(wqr$alcohol - wqr$volatile.acidity, wqr$quality)
## 
##  Pearson's product-moment correlation
## 
## data:  wqr$alcohol - wqr$volatile.acidity and wqr$quality
## t = 24.166, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4806396 0.5524750
## sample estimates:
##       cor 
## 0.5174684
cor.test(wqr$alcohol - wqr$volatile.acidity^3 + wqr$citric.acid - wqr$pH + wqr$sulphates, wqr$quality)
## 
##  Pearson's product-moment correlation
## 
## data:  wqr$alcohol - wqr$volatile.acidity^3 + wqr$citric.acid - wqr$pH +  and wqr$quality    wqr$sulphates and wqr$quality
## t = 26.927, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5241297 0.5916089
## sample estimates:
##       cor 
## 0.5587935
cor.test(wqr$fixed.acidity+wqr$citric.acid-wqr$density, wqr$quality)
## 
##  Pearson's product-moment correlation
## 
## data:  wqr$fixed.acidity + wqr$citric.acid - wqr$density and wqr$quality
## t = 5.6008, df = 1597, p-value = 2.509e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0903879 0.1865460
## sample estimates:
##       cor 
## 0.1387941

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three


Reflection

Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.

Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!